Overall Direction
TODO
Choice between continent and continentTopFive as first categorical
Compare BoxCox optimal versus log(points) transform of response
##
## studentized Breusch-Pagan test
##
## data: mod
## BP = 4864.1, df = 13, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod), 5000)
## W = 0.99665, p-value = 3.615e-09
##
## studentized Breusch-Pagan test
##
## data: mod
## BP = 4864.1, df = 13, p-value < 2.2e-16
##
## Call:
## glm(formula = boxcoxTrans(points, optimal_lamda, 0) ~ log(price) *
## continentTopFive, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -133.562 -14.671 0.719 15.653 91.658
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 463.0775 3.1920 145.075
## log(price) 25.5649 1.0856 23.550
## continentTopFiveAsia 10.7109 6.8207 1.570
## continentTopFiveAustralia -5.0864 3.5616 -1.428
## continentTopFiveEuropeTop5 -2.6382 3.2394 -0.814
## continentTopFiveNorthAmerica -19.4814 3.2543 -5.986
## continentTopFiveOtherEurope -8.6811 3.9573 -2.194
## continentTopFiveSouthAmerica -14.8069 3.4298 -4.317
## log(price):continentTopFiveAsia -6.6737 2.1369 -3.123
## log(price):continentTopFiveAustralia 1.4512 1.1940 1.215
## log(price):continentTopFiveEuropeTop5 0.6644 1.0980 0.605
## log(price):continentTopFiveNorthAmerica 4.1913 1.1016 3.805
## log(price):continentTopFiveOtherEurope 3.0472 1.3077 2.330
## log(price):continentTopFiveSouthAmerica 2.8761 1.1716 2.455
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## log(price) < 2e-16 ***
## continentTopFiveAsia 0.116336
## continentTopFiveAustralia 0.153265
## continentTopFiveEuropeTop5 0.415405
## continentTopFiveNorthAmerica 2.15e-09 ***
## continentTopFiveOtherEurope 0.028259 *
## continentTopFiveSouthAmerica 1.58e-05 ***
## log(price):continentTopFiveAsia 0.001790 **
## log(price):continentTopFiveAustralia 0.224211
## log(price):continentTopFiveEuropeTop5 0.545123
## log(price):continentTopFiveNorthAmerica 0.000142 ***
## log(price):continentTopFiveOtherEurope 0.019793 *
## log(price):continentTopFiveSouthAmerica 0.014092 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 548.6045)
##
## Null deviance: 91898757 on 102921 degrees of freedom
## Residual deviance: 56455794 on 102908 degrees of freedom
## AIC: 941264
##
## Number of Fisher Scoring iterations: 2
## [1] 90.07612 85.07601 94.94174
## [1] 0.0181028
## [1] 548.6591
## [1] 10.96185
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 3905.5, df = 13, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod1), 5000)
## W = 0.99545, p-value = 2.185e-11
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 3905.5, df = 13, p-value < 2.2e-16
##
## Call:
## glm(formula = log(points) ~ log(price) * continentTopFive, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.166397 -0.018153 0.001614 0.019576 0.102402
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 4.3764767 0.0039361 1111.887
## log(price) 0.0315053 0.0013386 23.536
## continentTopFiveAsia 0.0120168 0.0084106 1.429
## continentTopFiveAustralia -0.0048292 0.0043919 -1.100
## continentTopFiveEuropeTop5 -0.0013714 0.0039945 -0.343
## continentTopFiveNorthAmerica -0.0229106 0.0040129 -5.709
## continentTopFiveOtherEurope -0.0095902 0.0048798 -1.965
## continentTopFiveSouthAmerica -0.0192454 0.0042294 -4.550
## log(price):continentTopFiveAsia -0.0079028 0.0026350 -2.999
## log(price):continentTopFiveAustralia 0.0012709 0.0014723 0.863
## log(price):continentTopFiveEuropeTop5 0.0001409 0.0013539 0.104
## log(price):continentTopFiveNorthAmerica 0.0046728 0.0013583 3.440
## log(price):continentTopFiveOtherEurope 0.0033206 0.0016125 2.059
## log(price):continentTopFiveSouthAmerica 0.0037301 0.0014447 2.582
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## log(price) < 2e-16 ***
## continentTopFiveAsia 0.153077
## continentTopFiveAustralia 0.271515
## continentTopFiveEuropeTop5 0.731350
## continentTopFiveNorthAmerica 1.14e-08 ***
## continentTopFiveOtherEurope 0.049384 *
## continentTopFiveSouthAmerica 5.36e-06 ***
## log(price):continentTopFiveAsia 0.002708 **
## log(price):continentTopFiveAustralia 0.388032
## log(price):continentTopFiveEuropeTop5 0.917124
## log(price):continentTopFiveNorthAmerica 0.000582 ***
## log(price):continentTopFiveOtherEurope 0.039468 *
## log(price):continentTopFiveSouthAmerica 0.009825 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0008341881)
##
## Null deviance: 138.196 on 102921 degrees of freedom
## Residual deviance: 85.845 on 102908 degrees of freedom
## AIC: -437523
##
## Number of Fisher Scoring iterations: 2
## fit lwr upr
## 1 90.00091 85.00703 95.28817
## [1] 0.03654621
## [1] 0.0008342488
## [1] 1.029305
LDA
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2
##
## [1] This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## [2] Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
## # A tibble: 65,302 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 100 4.663715e-04
## 2 2 100 7.081641e-04
## 3 1 20222030 6.438382e-07
## 4 2 20222030 1.116309e-06
## 5 1 age 4.351469e-03
## 6 2 age 2.740947e-03
## 7 1 ahead 3.675568e-05
## 8 2 ahead 7.825068e-05
## 9 1 background 2.853492e-04
## 10 2 background 1.755885e-04
## # ... with 65,292 more rows
## # A tibble: 267 x 4
## term topic1 topic2 log_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 accent 0.000597273 2.168703e-03 1.86037010
## 2 acid 0.019160079 1.820266e-03 -3.39588224
## 3 add 0.001084956 1.153448e-03 0.08831623
## 4 age 0.004351469 2.740947e-03 -0.66682783
## 5 alcohol 0.001293881 1.779603e-03 0.45985068
## 6 almond 0.001585721 2.607108e-04 -2.60461729
## 7 almost 0.001079766 1.911925e-03 0.82430696
## 8 along 0.001881767 5.952606e-04 -1.66049514
## 9 also 0.003160329 9.200576e-04 -1.78027848
## 10 although 0.001887905 7.860205e-06 -7.90800368
## # ... with 257 more rows
Sentiment Analysis
wine_train_lda = data.frame(wine_train, topic1 = corpus_documents$gamma[1:nrow(wine_train)], document = rownames(wine_train))
wine_train_sentiment = merge(wine_train_lda, corpus_sentiments[,c(1,4)], by = "document")
km.out = kmeans(wine_train_sentiment$sentiment, 3)
plot(wine_train_sentiment$sentiment, col=(km.out$cluster + 1))
wine_train_clustered = data.frame(wine_train_sentiment, cluster = km.out$cluster)
Adding topic1 to the model
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 3581.2, df = 28, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: sample(resid(mod2), 5000)
## W = 0.99217, p-value = 5.541e-16
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 3581.2, df = 28, p-value < 2.2e-16
##
## Call:
## glm(formula = log(points) ~ log(price) * continentTopFive * cluster +
## topic1, data = wine_train_clustered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.130445 -0.018064 0.001548 0.019782 0.102926
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 4.391240 0.012356
## log(price) 0.027842 0.003906
## continentTopFiveAsia 0.001249 0.023942
## continentTopFiveAustralia -0.015017 0.012836
## continentTopFiveEuropeTop5 -0.010464 0.011670
## continentTopFiveNorthAmerica -0.034551 0.011730
## continentTopFiveOtherEurope -0.014660 0.013949
## continentTopFiveSouthAmerica -0.033462 0.012315
## cluster -0.004750 0.004762
## topic1 -0.010365 0.009020
## log(price):continentTopFiveAsia -0.003709 0.007512
## log(price):continentTopFiveAustralia 0.005764 0.004292
## log(price):continentTopFiveEuropeTop5 0.003928 0.003947
## log(price):continentTopFiveNorthAmerica 0.009176 0.003961
## log(price):continentTopFiveOtherEurope 0.006040 0.004604
## log(price):continentTopFiveSouthAmerica 0.008932 0.004193
## log(price):cluster 0.001867 0.001614
## continentTopFiveAsia:cluster 0.006376 0.009796
## continentTopFiveAustralia:cluster 0.004392 0.005316
## continentTopFiveEuropeTop5:cluster 0.004285 0.004827
## continentTopFiveNorthAmerica:cluster 0.005156 0.004850
## continentTopFiveOtherEurope:cluster 0.003699 0.005812
## continentTopFiveSouthAmerica:cluster 0.006518 0.005096
## log(price):continentTopFiveAsia:cluster -0.002328 0.003061
## log(price):continentTopFiveAustralia:cluster -0.001958 0.001775
## log(price):continentTopFiveEuropeTop5:cluster -0.001809 0.001631
## log(price):continentTopFiveNorthAmerica:cluster -0.002107 0.001636
## log(price):continentTopFiveOtherEurope:cluster -0.001728 0.001918
## log(price):continentTopFiveSouthAmerica:cluster -0.002419 0.001733
## t value Pr(>|t|)
## (Intercept) 355.397 < 2e-16 ***
## log(price) 7.128 1.03e-12 ***
## continentTopFiveAsia 0.052 0.95838
## continentTopFiveAustralia -1.170 0.24202
## continentTopFiveEuropeTop5 -0.897 0.36992
## continentTopFiveNorthAmerica -2.946 0.00322 **
## continentTopFiveOtherEurope -1.051 0.29327
## continentTopFiveSouthAmerica -2.717 0.00659 **
## cluster -0.998 0.31852
## topic1 -1.149 0.25051
## log(price):continentTopFiveAsia -0.494 0.62146
## log(price):continentTopFiveAustralia 1.343 0.17930
## log(price):continentTopFiveEuropeTop5 0.995 0.31959
## log(price):continentTopFiveNorthAmerica 2.316 0.02053 *
## log(price):continentTopFiveOtherEurope 1.312 0.18963
## log(price):continentTopFiveSouthAmerica 2.130 0.03314 *
## log(price):cluster 1.157 0.24721
## continentTopFiveAsia:cluster 0.651 0.51514
## continentTopFiveAustralia:cluster 0.826 0.40868
## continentTopFiveEuropeTop5:cluster 0.888 0.37462
## continentTopFiveNorthAmerica:cluster 1.063 0.28779
## continentTopFiveOtherEurope:cluster 0.636 0.52452
## continentTopFiveSouthAmerica:cluster 1.279 0.20087
## log(price):continentTopFiveAsia:cluster -0.761 0.44689
## log(price):continentTopFiveAustralia:cluster -1.103 0.27015
## log(price):continentTopFiveEuropeTop5:cluster -1.109 0.26724
## log(price):continentTopFiveNorthAmerica:cluster -1.288 0.19778
## log(price):continentTopFiveOtherEurope:cluster -0.901 0.36783
## log(price):continentTopFiveSouthAmerica:cluster -1.396 0.16281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0008383856)
##
## Null deviance: 120.669 on 89226 degrees of freedom
## Residual deviance: 74.782 on 89198 degrees of freedom
## AIC: -378841
##
## Number of Fisher Scoring iterations: 2
## [1] 0.004087244
## [1] 0.0008385478
## [1] 1.029381